Skip to content

table: native scan and Arrow stream integration#2

Draft
abnobdoss wants to merge 10 commits into
mainfrom
aba-158-opt-in-rust-arrow-scan
Draft

table: native scan and Arrow stream integration#2
abnobdoss wants to merge 10 commits into
mainfrom
aba-158-opt-in-rust-arrow-scan

Conversation

@abnobdoss
Copy link
Copy Markdown
Owner

@abnobdoss abnobdoss commented May 25, 2026

Status: active integration branch

Scope:

  • Env-gated native Arrow batch-reader path for PyIceberg scans.
  • Table-property and env mode selector for PyArrow, Rust read, and Rust plan-and-read.
  • Eager to_arrow materializes through the same guarded native batch reader when a native mode is enabled.
  • Arrow PyCapsule stream support for append, overwrite, Table, DataScan, to_polars, and to_duckdb.
  • Native equality-delete payload conversion at the adapter boundary.
  • Opt-in native upsert match-filter route for benchmarking.
  • Reproducible benchmark harness that records wall time, CPU time, RSS, rows, checksum, and fallback status.

Current value proven on the rebased branch:

  • Local 1M rows across 16 files: Rust plan-and-read is correctness parity but not a speed win. Full, projection, and filter are about 0.85x to 0.93x PyArrow wall time and use more RSS.
  • Local planning-heavy 128-file scans: Rust plan-and-read is about 1.32x faster for projection and 1.53x faster for filter, with lower CPU but much higher RSS.
  • Local planning-heavy 512-file spot check: Rust plan-and-read is about 1.49x faster for projection and 1.52x faster for filter, again with much higher RSS.
  • PyCapsule PyArrow handoff is effectively parity with normal PyArrow in this harness; it improves interop but is not a scan-planning speedup by itself.

Verification:

  • Focused pytest suite passed for Arrow capsule, scan backend selection, pyiceberg-core adapters, and native upsert route.
  • Commit hooks passed: trailing whitespace, EOF, AST, Ruff, Ruff format, mypy, pydocstyle, codespell.
  • Local fork pyiceberg-core release build imports the required scan, expression, schema, and file_io modules.

Known limits:

  • Published pyiceberg-core package does not expose the required native scan modules; reviewers need a local fork wheel until upstream packaging catches up.
  • Equality-delete support is adapter-level only so far; end-to-end equality-delete read parity still needs a real equality-delete table fixture.
  • Remote object-store, latency-injected S3, and concurrent workload behavior are not measured yet.
  • Memory amplification is the main production risk in the planning-heavy wins.

No external issue references.

Abanoub Doss and others added 9 commits June 6, 2026 11:44
Fan scan tasks out over a thread pool of native ArrowReaders so decode uses
multiple cores instead of one, streaming batches as they complete with at most
one decoded batch per shard in flight. A default batch size amortizes the
per-batch GIL handoff that otherwise dominates the fan-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A limited scan previously always fell back to PyArrow. Push the limit through the
native reader instead: truncate the streamed result at the limit (slicing the
crossing batch and closing the shards so they stop decoding early) and cap the
batch size to the limit so a small limit does not decode a full batch per shard.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The native reader emits arrow-rs's own types (string rather than large_string,
and run-end-encoded identity-partition columns), so its output diverged from the
PyArrow scan path. Cast every batch to schema_to_pyarrow(projected_schema),
decoding run-end-encoded columns first since there is no direct cast kernel for
them, so the native path is a faithful drop-in.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
PYICEBERG_RUST_SCAN_PLANNING plans the scan in pyiceberg-core (Table.plan_files)
instead of PyIceberg's Python manifest planning, then streams the planned tasks
through the same sharded, casted reader as the read path. Falls back to PyArrow
on any scan pyiceberg-core cannot handle.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The native path crashed on a normal REST+S3 catalog instead of degrading:
non-str FileIO props (the REST auth manager) reached the Rust binding, S3
path-style/region were never translated to the opendal keys, and the fallback
guard missed decode-time errors and schemeless local paths. Filter props to
strings, mirror PyArrow's path-style default and pass region, prime the first
batch so a decode mismatch falls back too, and skip native for bare paths.
Scoped to static-credential FileIO -- a refreshing auth manager is still not
carried to the native path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
plan_files() now resolves a ScanPlanner -- local manifest planning or REST
server-side, exactly as before -- or uses one injected on the scan. That
injection point is the seam a Rust or engine-specific planner plugs into
without touching any call site. The bundled RustScanPlanner is an opt-in,
deliberately-unimplemented stub: the native plan output exposes only
booleans/counts, not the residual, partition data, or deletes needed to rebuild
a faithful FileScanTask, so the fused native read path stays the supported
native route. Env flags and default resolution are unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@abnobdoss abnobdoss force-pushed the aba-158-opt-in-rust-arrow-scan branch from 60a6ee3 to deaa353 Compare June 6, 2026 16:47
@abnobdoss abnobdoss changed the base branch from aba-156-157-core-adapters to main June 6, 2026 17:02
@abnobdoss abnobdoss changed the title table: add opt-in pyiceberg-core arrow reader table: native scan and Arrow stream integration Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant